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Abstract 

Background: Interpretation of quantitative metagenomics data is important for our understanding of ecosystem 
functioning and assessing differences between various environmental samples. There is a need for an easy to use 
tool to explore the often complex metagenomics data in taxonomic and functional context. 

Results: Here we introduce FANTOM, a tool that allows for exploratory and comparative analysis of metagenomics 
abundance data integrated with metadata information and biological databases. Importantly, FANTOM can make 
use of any hierarchical database and it comes supplied with NCBI taxonomic hierarchies as well as KEGG Orthology, 
COG, PFAM and TIGRFAM databases. 

Conclusions: The software is implemented in Python, is platform independent, and is available at www.sysbio.se/ 
Fantom. 
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Background 

Metagenomics [1] is the culture independent study of an 
environmental sample by sequencing of the recovered 
genetic materials of targeted ribosomal RNAs (16S) 
through amplicon sequencing or whole genomic DNA. 
This allows for determining the ecosystems taxonomic 
diversity, functional capacity, dynamics and comparison 
with other environments. Typically for whole genome 
based metagenomics, extracted DNA from an environ- 
mental sample is a starting material to generate short 
reads of DNA through next generation sequencing 
(NGS) technologies that represent the microbiota of the 
sample. The generated raw sequence reads data typically 
contain errors that need to be eliminated before further 
steps using trimming and filtering processes based on a 
base calling quality score (Phred) [2,3]. The high quality 
reads can be annotated to reference taxonomic and 
functional features using sequence similarity based align- 
ment methods i.e. BLAST [4], HMMER [5], etc. against 
reference databases. Another approach is based on map- 
ping high quality reads on reference genomes or well 
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annotated genes by short read aligners [6]. There are 
web services such as CAMERA [7], IMG/M [8] and 
MG-RAST [9], available for performing the above men- 
tioned pipeline of NGS processing and annotation in an 
automated fashion. Depending on user-given parameters 
such as percentage similarity or e-value thresholds, each 
of these individual software tools or web services are 
able to report the annotated sequences in terms of abun- 
dance data for each feature in the subjected database. 
Further analysis of the hereby obtained quantitative 
abundance data of metagenomics features, in particular 
together with sample meta data is important for bio- 
logical interpretation [10,11]. 

Although, the above mentioned web-services can to 
some extent provide both analysis tools for the compara- 
tive analysis of metagenomes, these methods have some 
limitations; 1) statistical and visual analysis capabilities 
are limited, 2) functional annotation sources might not 
satisfy user s demand, and 3) users may simply not want 
to upload their sequencing data to an online service. 
There are several standalone software tools available for 
statistical analysis and visualization of annotated metage- 
nomics data, e.g. MEGAN [12], SmashCommunity [13], 
STAMP [14], shotgunFunctionalizeR [15], VEGAN [16], 
QIIME [17] and Mothur [18]. 
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We identified the requirement for a user-friendly com- 
parative analysis and data visualization tool where annotated 
metagenomics data can meet sample metadata and be ana- 
lyzed at different hierarchy levels using a built-in or user 
provided biological database. This tool, FANTOM for Func- 
tional ANnotation and Taxonomic analysis Of Metagen- 
omes, is an easy installed, standalone software tool that is 
accessed through a graphical user interface to analyze abun- 
dance of metagenomics features that are easily integrated 
with NCBI taxonomy, KEGG [19], COG [20] and protein 
family databases PFAM [21] and TIGRFAM [22] with hier- 
archy information. We believe that this tool will be highly 
useful for a broad community of scientists desiring to 
analyze metagenomics data. 

Implementation 

The software installer, user manual and demonstration 
videos can be found and downloaded at the website 
www.sysbio.se/Fantom 

FANTOM was implemented in Python allowing it to 
operate platform independent in addition to the 
utilization of core scientific packages including numpy, 
scipy and matplotlib to implement statistical functions 
and various plotting options. wxPython was incorpo- 
rated to provide graphical user interface components 
and storm package was used for object relational map- 
ping of data from the local SQLite database. The soft- 
ware was tested successfully on Windows, Linux and 
OSX operating systems and the installers are provided 
for the different platforms. 

FANTOM requires two input files; a metagenomics 
abundance file, which could be derived from annotation 
of metagenomics data, including either taxonomic or 
functional annotations and another file containing the 
samples' metadata (see user manual and demonstration 
videos). Besides, there are web services such as CAM- 
ERA [7], IMG/M [8] and MG-RAST [9] that allow the 
users to easily obtain metagenomics abundance from 
their metagenome data. Metadata can either be numer- 
ical or categorical and the software will automatically 
recognize the format and display options for selecting 
and filtering samples. Functional hierarchy information 
was downloaded from KEGG Orthology, COG, PFAM 
and TIGRFAM databases and taxonomic lineage infor- 
mation was downloaded from the NCBI taxonomy data- 
base and constitute the standards feature databases in 
the software package. Moreover, FANTOM provides the 
option that allows the user to create and use a custom 
made hierarchical database. The custom database can be 
easily imported as a tabular input file to analyze the 
abundances of corresponding database levels. 

In FANTOM, the abundance can be specified at differ- 
ent levels in hierarchical databases, which are called 
nodes (e.g. pathways or Genera), the abundance of a 



higher node in the hiearchy is calculated by summing 
the abundance of all member nodes further down in the 
hierarchy structure (e.g. orthologs or species). The abun- 
dance of nodes that are members of more than one 
higher level nodes are split equally between higher 
nodes. 

The metadata file can include both categorical and nu- 
merical properties of each sample, which can then be 
used in FANTOM to filter and select sample groups of 
interest for comparative analysis. Numerical variables 
can further be used for correlation analysis with the 
annotated features. Taxonomic or functional feature 
abundances can be displayed and processed either as ab- 
solute counts or as normalized relative values. After 
selecting relevant subsets of metagenomics data, princi- 
pal component analysis can be applied to reduce the 
dimensionality. Furthermore, hierarchical clustering, an- 
other multivariate analysis method is implemented to 
evaluate high dimensional metagenomics data by draw- 
ing dendograms for features and samples as well as a 
heatmap with 2-dimensional clustering, reflecting abun- 
dance values. 

By defining groups of samples based on metadata, statis- 
tical hypothesis tests can be performed to compare meta- 
genomics features between groups. FANTOM, currently 
supports two sample comparisons. Non-parametric Mann- 
Whitney U test was implemented in FANTOM and is 
encouraged because of the typically non-normally distribu- 
tion of metagenomics data. Shapiro Wilks normalty test, 
Bartletts test and Levenes test for equality of variances 
and Students t-test were also implemented as parametric 
hypothesis tests. Obtained p-values of these tests can be 
adjusted for multiple testing using either Bonferroni or 
Benjamini-Hochberg false discovery rate (fdr). Results can 
finally be filtered according to p-values, absolute fold 
change and mean relative abundance. The multivariate and 
statistical methods that are provided in FANTOM are 
summarized in Figure 1. 

FANTOM provides several options for graphical rep- 
resentation of the data and comparative analysis. After 
hypothesis testing, significant results can be displayed by 
bar charts, boxplots, pie charts and area plots. Plotting 
options make use of the hierarchies in NCBI taxanomy, 
KEGG and COG, groups of metagenomics data accord- 
ing to the specified level and added filtering options. 
The software provides means to save the figures in high 
quality formats that can be used directly for publication. 
An example of a screen shot of FANTOM is shown in 
Figure 2. 

Results and discussion 

The software was evaluated using metagenomics data 
from the gut microbiome of 124 subjects in the Meta- 
HIT [23] project. Sequences were quality trimmed 



Sanli et a I. BMC Bioinformatics 2013, 14:38 
http://www.biomedcentral.com/1471-2105/14/38 



Page 3 of 6 



Multivariate 
Analysis Methods 



Principal 
Component 
Anaylsis (PCA) 



Clustering: 

• Hierarchical Clustering 

• K-means Clustering 



Statistical 
Hypothesis Tests 



Multiple Testing 
Correction 



Correlation 
Analysis 



Parametric Tests 

• Tests for Equality of Variances 

• Levene's Test 

• Bartlet's Test 

• Independent two-sample t-tests 

• Student's t-test (Equal 
population variance ) 

• Welch's t-test (Unequal 
population variance) 



Bonferroni 



False 
Discovery 
Rate 



Pearson 
correlation 



Spearman's 

rank 
correlation 



Non-parametric test 
• Mann-Whitney U Test 



Figure 1 Statistical analyses provided in FANTOM. 



(SolexaQA -p 0.05) and sequences shorter than 35 bp 
were filtered out. High quality reads were aligned to a 
reference catalogue of 440 genomes to obtain taxonomic 
abundance. Moreover, the reads were aligned to the 
MetaHIT gene catalogue of 3.3 million genes to get the 
abundance of genes. The genes were annotated to the 
KEGG and COG database and this information was used 
to transform gene abundance to KEGG KO and COG 
abundances. This data are available as example files 



together with metadata included bundled with the 
software. 

The MetaHIT study focused on two human diseases, 
obesity and inflammatory bowel disease (Crohn's disease 
and ulcerative colitis), which we make use of here as ex- 
ample capabilities of FANTOM. 

Differences based on Mann-Whitney U test (FDR < 0.2) 
were observed for lean (BMI < 25) and obese (BMI > 30) 
individuals in species and genus level taxonomy terms. At 
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Figure 2 Graphical user interface of FANTOM and examples of plots that can be generated. A) FANTOM data manipulation panel B) Bar 
graph comparing two types of patients by KEGG pathway level abundances C) Area plot showing two sets of samples and individual profiles of 
KEGG pathway abundance in each sample. 



Sanli et a I. BMC Bioinformatics 2013, 14:38 
http://www.biomedcentral.eom/1 471 -21 05/1 4/38 



Page 4 of 6 





Anaerotruncus 




Figure 3 Comparison of healthy lean subject with obese subjects. A) Genera with a FDR < 0.2 that are differentially abundant between lean 
an obese subjects. B) Area plot of the significant species in A). 



the genus level, particularily Prevotella was enriched in 
obese individuals whereas Bacteroides, Bifidobacterium, 
Alistipes and unclassified Clostridiales were enriched in 
normal weight subjects (Figure 3A). Previous reports have 
discussed the association between the ratio of Firmicutes 
to Bacteroidetes with obesity and came to different con- 
clusions [24-26]. Here we observed changes within the 
Bacteroidetes phyla by an increase of Prevotella and a de- 
crease in Bacteroides in obese subjects. To get an 



appreciation of the variability and profiles in the micro- 
biota across individuals, the relative abundance profiles 
were plotted in area plots (Figure 3B). 

Comparisons between Spanish Crohn's disease (CD) 
patients and healthy individuals in taxonomical terms are 
illustrated in Figure 4A. Based on Mann- Whitney U test 
(p-value < 0.05), it is clearly seen that there was a decrease 
in CD patients of several common Firmicutes species 
commonly known to be present in a healthy gut such as 
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Figure 4 Taxonomic and functional differences in Crohn's disease (CD) patients compared to healthy subjects. A) Differentially abundant 
species between CD and healthy subjects. B) Differentially abundant KEGG pathways between CD and healthy subjects. 
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Ruminococcus sp, Faecalibacterium sp., Clostridium sp., 
Alistripes sp., Coprocouccus sp., Methanobrevibacter sp., 
Eubacterium sp. Dorea sp. and butyrate producing bac- 
teria. The loss of Firmicutes and Faecalibacterium praus- 
nitzii in particular has been observed previously [27] and 
is confirmed here. Subsequently, an increase of several 
Bacteroides sp. was observed in CD patients. By using the 
functional information and testing for differential abun- 
dance of KEGG pathways between CD patients and 
healthy subjects specific metabolic pathways could be 
identified as seen in Figure 4B. The results are consistent 
with the taxonomical changes as the enrichment of the 
Gram negative Bacteroides sp. are consistent with the 
decreased number of genes for peptidoglycan biosynthesis 
as well as ABC transporter but an increase in membrane 
structure and transport as well as ion channels in CD 
patients. 

Conclusion 

We provide an open source standalone user-friendly soft- 
ware tool, FANTOM, for data analyses and data mining of 
read counts from whole shotgun metagenomics or ampli- 
con sequencing studies. FANTOM allows the user to inte- 
grate sample metadata, taxonomy and gene functional 
profiling in the analysis, and FANTOM is supplied with 
access to biological databases as well as the possibility to 
upload custom made databases. 

Availability and requirements 

Project name: FANTOM : Functional and taxonomic 

analysis of metagenomes 

Project home page: www.sysbio.se/Fantom 

Operating system(s): Windows, Linux, Mac OSX 

Programming language: python 

Other requirements: - 

License: GNU-GPL version 3 software license 
Any restrictions to use by non-academics: No 
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